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Abstract  This  paper  proposes  a hierarchical  multiprocessor 
architecture  for  real-time  programs  arittan  in  a concurrent 
programming  language.  The  use  of  processes  and  monitors  leads 
^ to  a multiorocassor  system  in  ahich  each  processor  has  a local 
CJ>  store  dedicated  to  a single  process.  The  processors  share  a 
LjJ  common  store  that  contains  the  monitors.  To  avoid  congestion 
£~"  in  the  common  store  the  processes  and  monitors  ar9  partitioned 
^ into  subsystems  that  share  a hierarchy  of  common  stores.  The 

ir1^  main  goal  is  to  develop  a synthesis  of  an  abstract  language  and 
t-J  a computer  architecture  that  match  in  an  obvious  «ay. 
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language  far  concurrent  programming. 

The  proposal  rests  on  ths  assumption  that  simplicity, 
reliability*  and  efficiency  are  essential  for  real-time 
applications.  Without  simplicity  one  cannot  expect  to  understand 
the  purpose  and  details  of  a large  program.  Without  reliability 
one  cannot  seriously  dapend  an  it.  And  without  efficiency  a 
real-time  program  cannot  keep  pace  with  its  environment. 

Although  efficiency  is  important  we  will  not  let  it  eomoromise 
the  vital  need  for  simplicity.  Where  such  a conflict  exists,  we 
will  settle  for  a simple  systam  that  can  handle  many  (but  net  ail) 
real-time  applications. 

These  programming  goals  can  be  met  by  careful  design  of  the 
programming  language,  the  compiler,  and  the  computer  architecture: 

(1)  To  obtain  simplicity  real-time  programs  must  be  written  in 
an  abstract  language  that  supports  modular  programming. 

(2)  To  make  real-time  programs  reliable  the  programming  language 
must  permit  extans ive  compilation  cheeks  that  ensure  the  integrity 
of  modules . 

(3)  To  make  real-time  programs  efficient  they  must  be 
executed  by  a multiprocessor  system  in  which  each  processor  is 
dedicated  to  a single  process. 

(4)  To  make  the  language  implementation  straightforward  ths 
multiprocessor  architecture  must  support  the  language  eoncacts 
in  an  obvious  way. 

(5)  To  make  ths  multiprocessor  inexpensive  it  must  use 
microprocessor  technology. 
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Initially,  as  looked  at  programming  languages  in  which  all 
communication  between  concurrent  processes  takes  place  by 
messages  only.  This  approach  s earned  natural  for  microprocessor 
networks  without  common  store  modules,  message  systems  with 
completely  deterministic  input/output  behavior  have  existed  for 
more  than  a decade.  But  a recant  proposal  has  pointed  out  an 
occasional  need  for  nondetarministic  process  communication  [ij . 
Since  these  ideas  are  still  at  a very  early  stage,  we  felt  that 
it  would  be  more  appropriate  for  us  to  experiment  in  only  one 
area  at  a time. 

Now  we  da  have  several  years  of  experience  with  the  programming 
lanouaqe  Concurrent  Pascal  that  includes  processes  and  monitors 
M . So  we  decided  to  look  for  a multiprocessor  architecture  that 
supports  a language  of  this  type  directly. 

In  any  multiprocessor  system,  processes  need  to  communicate  to 
cooperate  on  common  tasks.  Each  process  uses  some  portion  of  the 
transmission  capacity  of  the  communication  lines  (which  may  be 
bus  lines  or  common  stores).  As  more  processes  are  connected  to 
the  same  line,  a point  is  soon  reached  where  that  line  becomes  a 
bottleneck. 

So  a key  problem  is  to  avoid  congestion  of  the  communication 
lines.  To  do  that  one  must  take  advantage  of  the  locality  of 
store  references  within  a concurrent  program.  One  may,  for 
example,  observe  the  program  during  its  execution  to  discover 
patterns  of  store  references.  Demand-paging  systems  and  cache 
stores  are  typical  examples  of  this  approach.  Bath  require 
complicated  run-time  mechanisms. 

'Us  will  instead  depend  an  the  compiler  to  exploit  the 
locality  of  references  that  is  determined  by  the  modularity  of 
the  programming  language.  The  use  of  processes  and  monitors 
naturally  leads  to  a multiprocessor  system  in  which  eacn  procassor 
has  a local  store  dedicated  to  a single  process  and  in  which 
several  procsssors  shars  a common  store  that  contains  the 

monitors. 
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Although  performance  measurements  are  scarce  for  concurrent 
programs  preliminary  estimates  for  Concurrent  Pascal  programs 
suggest  that  each  process  refers  to  its  ouin  code  and  variables 
an  order  of  magnitude  more  often  than  it  refers  to  the  monitors. 
The  simplest  architecture  proposed  here  therefore  consists  of 
about  10  processors  each  with  their  own  store  and  sharing  a 
single  common  stare. 

For  systems  with  up  to  100  processors  we  propose  a block- 
structured  programming  language  that  maps  directly  onto  a 
multiprocessor  architecture  with  a hierarchy  of  common  stores. 

!Jia  would  expect  that  more  general  multiprocessor  systams, 
such  as  the  Cm*,  can  be  configured  to  handle  the  special  cases 
presented  here  ^3^ . Our  purpose,  however,  is  to  take 

full  advantage  of  the  characteristics  of  a concurrent  programming 
language  and  develop  a synthesis  of  a language  and  an  architactur 
that  match  in  an  obvious  way.  This  viewpoint  leads  to  many 
simplifying  assumptions  that  should  have  a beneficial  affect  on 
the  cast  and  reliability  of  the  hardware.  It  is  cur  view  that 
the  search  for  uttar  simplicity  and  full  generality  are 
complimentary  (but  of tan.  conflicting)  research  gcals.  Seth 
approaches  should  be  triad  before  any  firm  conclusions  can  be 
mads  about  their  merits. 

2.  Programming  concepts 

most  complex  systams  in  nature  are  organized  hierarchically 
as  subsystems  which  in  turn  may  be  further  subdivided  fal  . In 
such  systams,  each  subsystam  spends  most  of  its  time  performing 
an  autonomous  function  and  uses  much  lass  time  intaracting  with 
a few  other  subsystams.  The  exact  timing  of  events 

in  each  subsystam  is  independent  of  the  exact  timing  of  other 
subsysesms.  In  the  short  run  the  behavior  of  each  subsystam  is 
independent  of  other  subsystams.  In  the  long  run  a subsystem  is 
influenced  only  by  the  average  behavior  of  other  subsystams. 


4 


In  the  design  of  a real-time  computer  system  one  can  take 
advantage  of  tha  hierarchical  nature  of  the  uiorld  by  identifying 
subtasks  that  are  nearly  autonomous  and  which  are  loosely 
connected  to  one  another*  The  great  advantage  of  such  systems  is 
that  they  can  be  designed*  tested,  and  tuned  piecemeal  by 
focussing  the  attention  on  one  subsystem  at  a time. 

A real-time  application  will  be  controlled  by  a concurrent 
program  that  runs  on  a multiprocessor  system  dedicated  to  that 
application.  The  subtasks  will  be  performed  by  a fixed  number 
of  asynchronous  processes  that  are  executed  simultaneously.  The 
processes  communicate  by  means  of  a fixed  number  of  monitors 

C*.  «]• 

Figure  1 shows  an  example  of  two  processes  that  communicate  by 
means  of  a buffer  monitor*  The  arrows  indicate  that  these  processes 
have  access  to  that  monitor. 


buffer  monitor 


sender  process  receiver  process 
Pig.  t.  A hierarchical  subsystem 

The  following  shows  how  the  buffer  monitor  can  be  programmed: 


The  monitor  defines  a common  data  structure  consisting  of  a 
massage  slot  and  a boolaan  indieating  whether  or  not  it  is  full. 
Tha  monitor  also  dafinaa  two  operations,  sand  and  racsiva,  on 
tha  buffer  and  an  initial  statamant  that  makes  it  empty  to  begin 
with. 

Processes  can  perform  tha  sand  and  racaive  operations  on  tha 
buffer  but  cannot  accass  tha  data  structure  diractly.  This  is 
guaranteed  by  tha  compiler. 

The  operations  on  a monitor  taka  place  strictly  one  at  a time. 
When  a process  performs  a monitor  operation  tha  computer  will 
delay  further  operations  on  tha  same  monitor  until  the  currant 
operation  is  finished.  These  short-term  delays  are  implicit  in 
the  monitor  concept. 

A monitor  operation  may,  however,  postpone  its  own  eomoletion 
until  the  common  variables  satisfy  a certain  condition.  These 
medium-term  delays  are  expressed  in  tha  language  by  means  of  a 
statement  of  the  form  £7] : 

when  bcelean  expression  t s tat  siren  t and 

This  delays  the  execution  of  the  next  statamant  until  a 
boolean  expression  becomes  true  (as  tha  result  of  another 
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Whan  a monitor  oparation  delays  itsalf  explicitly  another 
monitor  operation  can  taka  place*  The  delayed  oparation  is 
resumed  ahen  its  precondition  is  satisfied  and  no  other  operation 
is  in  progress* 

The  boolean  expressions  used  for  synchronization  are  more 
elegant*  but  less  efficient  then  the  queue  variables  used  in 
Concurrent  Pascal  [2*  a].  The  lorn  cost  of  microprocessor 
technology  should*  homever*  make  the  more  elegant  concept  the 
obvious  choice* 

The  example  belom  shams  taa  concurrent  processes  that 
communicate  by  means  of  the  monitor  buffer: 


process  sender 
var  x*  message 

cycle  produea(x)»  buffer .send(x)  and 


process  receiver 
var  y:  message 

cycle  buffer.racaive(y) 1 consuma(y)  and 


Each  process  defines  a local  data  structura  consisting  of  a 
single  message  and  a cyclical  statement  that  operates  on  it. 

The  variables  of  one  process  are  inaccessible  to  other 
processes*  This  too  is  guaranteed  by  the  compiler*  The  checking 
of  access  rights  during  compilation  makes  a hard mar a protection 
system  unnecessary  £2} • 

The  processes  are  loosely  connected  if  they  spend  most  of 
their  time  operating  on  their  local  data  structures  and  very 
little  time  on  exchanging  data.  Loosely  coupled  processes  spend 
only  a fraction  of  their  time  mithin  monitors. 


A monitor  can  also  be  used  to  control  the  access  to  common 
resources*  such  as  a peripheral  device.  The  simplest  example 
of  a rasourca  scheduler  is  shoun  belom: 
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monitor  resource 
var  freet  boolean 

procedure  request 

when  free*  free  is  false  end 

procedure  release 
begin  free  is  true  end 

begin  free  is  true  end 


The  resource  is  free  to  begin  with,  A request  operation  delays 
the  calling  process  until  the  resource  is  free  and  makes  it 
available  exclusively  for  that  process.  A release  operation  makes 
the  resourca  free  again. 

A study  of  three  model  operating  systems  written  in  Concurrent 
Pascal  shows  that  most  monitors  either  serve  as  buffers  or  as 
schedulers 

In  this  discussion  the  details  of  the  programming  language  are 
not  important.  Ida  will  just  assume  that  a concurrent  program 
consists  of  a fixed  sat  of  monitors  W fallowed  by  a fixed  set  of 
processes  Pi.  P2f  ...  Pn.  The  monitors  are  initialized  before 
the  processes  are  executed. 

3.  Sinqle-processor  system 

Ida  will  describe  three  computer  architectures  for  a 
concurrent  programming  language.  These  computers  will  use  very 
similar  methods  of  store  allocation,  but  will  gradually  increase 
the  degrae  of  multiprocessing  from  1 to  10  and.  finally,  to  100. 

A single-processor  system  that  has  already  been  implemented 
is  described  first.  Its  purpose  is  to  introduce  the  storage 
allocation  scheme  and  to  characterize  the  performance  of  a 
single  processor  programmed  in  Concurrent  Pascal. 

On  a single  processor  with  a single  store  the  code  and 
variables  of  all  monitors  can  be  stored  in  a single  segment  m. 
The  code  and  variables  of  each  process  can  be  stared  in  single 
segments  Pi.  P2.  ...  . Pn  (Fig.  2). 


a 


common  storm 


pr ocas a or 
Fig.  2,.  Singla  procassor  system 


Tha  eompilar  ansuras  that  processes,  only  communicate  by  means 
of  monitors.  So  aach  pracass  only  needs  access  to  a subset  of 
tha  whole  store.  This  is  important  on  a processor  with  a short 
word  length  (16  bits).  It  makas  it  possible  to  axtand  the  stora 
beyond  the  addressing  limit  by  means  of  an  address  map  that 
anablas  aach  process  to  see  a virtual  store  consisting  of  tha 
common  segment  n and  its  local  segment  Pi.  This  corresponds 
closely  to  tha  implementation  of  Concurrent  Pascal  on  the 
POP  11/45  computer  £2} . 

On  a singla  processor  concurrent  processes  are  simulated  by  means 
of  clock  interrupts.  A process  has  exclusive  access  to  the  common 
segment  as  long  as  it  performs  a monitor  operation  without 
delaying  itself.  This  is  ensured  by  disabling  clock  intarrupta 
temporarily.  Processes  that  are  ready  to  continue  or  are  salting 
for  conditions  to  be  satisfied  are  rescheduled  periodically  in 
round-robin  order.  This  is  the  classical  taehnique  of  processor 
multiplexing. 

The  Concurrent  Pascal  compiler  generates  virtual  cade 
for  a stack  machine.  This  virtual  coda  is  interpreted  by  a 
machine  program  of  1 K words  on  the  POP  11/45. 
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So  far,  three  modal  operating  systems  have  been  written  in 
Concurrent  Pascal.  The  largest  one  is  the  Solo  system  which 
requires  a store  of  39  K words  [2] • The  monitors  occupy  a 
common  segment  of  about  4 K words  while  the  process  segments  vary 
from  200  words  to  20  K wards  each. 

One  of  our  main  concerns  will  be  to  avoid  congestion  of  the 
common  starea  in  a multiprocessor  system.  To  illustrate  the 
problem  ws  will  consider  text  processing  as  an  example  that 
involves  a fair  amount  of  data  transmission  between  processes 
and  a minimal  amount  of  data  processing.  The  processing  of  a 
stream  of  input  data  is  common  in  real-time  applications  (although 
the  data  items  may  represent  measurements  rather  than  characters). 

In  the  Sola  system,  the  internal  speed  of  text  processing  is 
about  1000  - 3000  char/sec  for  lexical  analysis  and  line-oriented 
editing.  In  performing  these  tasks  the  system  spends  less  than  10 
par  cent  of  its  time  on  monitor  operations.  So  the  processes  are 
certainly  loosely  connected. 

4,  A two-level  multiprocessor 

figure ">3  shouts  a proposed  multiprocessor  systam  with  n 

% 

microprocessors  and  n <►  1 stare  modules.  Each  processor  is 
dedicated  to  the  execution  of  a single  process.  When  a process 
delays  itself  its  processor  is  also  idle.  A processor  has  a local 
store  that  holds  the  code  and  variables  of  a single  process.  The 
processors  are  connected  to  a single  common  store  that  holds  the 
code  and  variables  of  all  monitors. 
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common  stars  ( monitors ) 


local  stars  (pracsssss) 


pracsssors 
Fig*  A tmc-level  multiprocessor 


Ths  virtual  addrsss  spacs  of  a single  procassor  consists  of  its 
local  stars  and  ths  common  stars.  A procassor  can  accss3  its  cuin 
stars  directly  (but  no  other  procassor  can).  Access  to  tha  common 
stors  is  controlled  by  a round-robin  arbiter.  Initially,  uie  mill 
assume  that  a processor  has- exclusive  access  to  the  common  stare 
for  ths  duration  of  a monitor  operation  (cr  until  the  operation 
delays  itself).  This  is  achieved  by  rsouest  and  release 
microinstructions  on  the  common  arbiter.  From  tha  point  of  vism 
of  a processor,  its  local  stare  and  the  common  store  operata  at 
ths  same  speed  (say,  3 jisec/uicrd) . 

A peripheral  device  can  be  attached  either  to  ths  local  store 
of  a single  procassor  cr  to  the  common  store.  A procassor  remains 
idle  during  an  input/output  operation.  This  makas  input/output 
appear  to  be  an  indivisible  operation  and  eliminates  ths  need 
far  interrupts. 

Fast  response  to  external  events  is  guaranteed  by  using 
dedicated  processors  that  respond  immediately  to  these  events. 

As  long  as  tha  common  stors  is  used  primarily  for  internal 
communication  (but  is  not  a critical  factor  in  the  immediate 
response  to  real-time  events)  it  is  quits  acceptable  that  it 
is  mcncpolized  by  a single  processor  during  a monitor  operation. 
Later  uie  uiill  relax  things  a bit  and  permit  operations  on 
different  monitors  to  take  place  simultaneously. 
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To  make  an  evaluation  of  such  a system  we  will  assume  that  the 
processor  and  store  technology  is  comparable  to  that  of  the  LSI 
11  microcomputers.  If  the  virtual  coda  generated  by  the  compiler 
must  be  executed  by  an  interpreter  written  in  machine  code  then 
each  processor  will  be  3 times  slower  than  the  POP  11/45  [9] . But, 
if  the  virtual  code  is  executed  directly  by  microprograms,  than 
it  should  be  slightly  fastar  than  the  interpreted  code  on  the 
POP  11/45.  We  will  assume  the  latter  and  use  the  known  performance 
figures  for  Concurrent  Pascal  statements  [2]  raducad  somewhat  to 
taka  the  absence  of  processor  multiplexing  into  account. 

The  send  and  receive  operations  on  the  buffer  defined  earlier 
will  than  taka  about  (7tl  +-  3c)  ji sac  each,  where  c is  the  number 
of  characters  par  message.  So  it  will  take  0.7  msec  to  transmit 
a block  of  100  characters  from  ona  procass  to  another.  If  the 
common  store  is  used  only  for  transmitting  blacks  of  this  sire 
by  means  of  monitors  then  it  has  a capacity  of  135,000  char/sec. 

To  evaluate  a case  where  the  total  traffic  through  the  common 
store  is  high  we  assume  that  all  the  processors  operate  an 
characters  at  the  highest  possible  rata.  Consider  therefore  a 
machine  with  10  processors  and  assume  that  each  processor  has  a 
throughput  of  30Q0  char/sec  - a total  throughput  of  30,000 
char/sec.  This  means  that  each  processor  uses  the  common  store 
only  2 per  cent  of  the  time  while  all  of  them  use  it  for  only 
20  par  cant  of  the  time.  It  is  this  strong  localization  of 
references  to  the  local  stores  which  makes  the  two-level  system 
practical. 

With  a utilization  factor  of  only  0.2  each  processor  will  an 
the  average  immediately  get  access  to  the  common  store  when  it 
needs  it  M . Since  the  common  store  only  comprises  9 per  cent 
of  the  whole  stare,  its  low  utilization  does  not  matter. 

In  practice,  it  seems  unlikely  that  an  aoplication  will  make 
it  possible  far  10  processors  to  operate  continuously  without 
idle  periods.  Sc  the  common  store  should  certainly  have 
sufficient  speed  for  such  a system. 
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In  some  applications  procasses  might  interact  mora  aftan  than 
in  the  previous  example.  In  that  case,  it  may  be  necassary  to 
reduce  the  number  of  procasses  connected  to  a single  common  store 
somewhat  and  use  a hierarchy  of  common  stores  as  described  later. 
However,  if  each  process  spends  more  than  10  per  cent  of  it3  time 
in  a common  store,  then  me  have  a tightly  coupled  system  uihich 
the  machine  mas  not  built  to  handle. 

Identicals  procasses  that  operate  in  unison  on  common  arrays 
is  an  important  example  of  tight  coupling.  Another  example  is 
a pipeline  that  performs  very  fast  operations  on  a stream  of 
small  data  items.  These  special  applications  require  a different 
kind  of  computer  architecture  which  may  be  highly  specialized 
if  extreme  speed  is.  required  £igQ.  This  paper  concentrates  on 
computer  architectures  for  a wida  variety  of  applications  in 
which  different  processes  operate  asynchronously  at  medium 
speeds.  It  may  well  be.  feasible  to  handle  fast  real-time 
applications  by  connecting  a general-purpose  multiprocessor  to 
special-purpose  processors  for  data  reduction  or  synchronous 
computation. 

5,  The  overhead  of  synchronization 

So  far  we  have  ignored  the  problem  of  the  periodic  evaluation 
of  a boolean  expression  S within  a delayed  monitor  operation. 

'Jihen  a process  attempts  to  execute  the  synchronization 
statement 

when  S t S end 

the  procass  is  in  the  middle  of  a monitor  operation  and  has  . 
exclusive  accass  to  the  common  store.  The  code  generated  for 
the  when  statement  will  therefore  be  equivalent  to  the  following 
program  piece 

while  net  B : release;  request  and 

r 

This  coda  sequence  evaluates  the  boolean  expression.  As  long  as 
it  is  false  the  processor  releases  the  common  store  again  and 


waits  for  anathar  turn.  When  anothar  process  has  completed  a 
monitor  operation  that  makaa  tha  expression  true,  the  statement 
3 eill  be  executed.  The  request  and  release  operations  refer  to 
the  arbiter  that  gives  a processor  exclusive  access  to  the 
common  store. 

Typical  synchronizing  conditions,  such  as 

not  full  length  < max  user  in  turn 

will  take  30-100  psec  each  time  they  are  reevaluated. 

Previously,  we  considered  tha  extreme  case  in  which  10 
processors  were  running  continously  without  delay.  Hie  will  now 
examine  the  other  extreme  case  in  which  9 processes  are  waiting 
far  different  conditions  to  ba  satisfied.  When  the  last  process 
executes  a sand  operation  on  a buffer  it  becomes  passible  far 
one  of  the  waiting  processes  to  complete  a receive  operation. 

In  that  case,  the  send  and  receive  operations  can  now  both  be 
delayed  by  the  reevaluation  of  3-9  conditions  of  100  psec  each. 
This  increases  the  response  time  of  the  interaction  from  0.7  to 
2.4  msec  in  the  worst  case.  However,  at  the  processing  rata  of 
3000  char/sec,  the  exchange  of  a block  of  100  characters  only 
takas  place  every  33  msec.  Tha  only  effect  of  the  2.4  msec  is 
to  slam  the  processes  down  by  7 per  cent. 

Sines  the  arbitar  interleaves  tha  revaluation  of  conditions 
with  new  monitor  calls  in  round-robin  order,  the  amount  of 
reevaluation  is  automatically  reduced  when  the  traffic  intensity 
increases. 

Nevertheless,  the  revaluation  still  has  several  undesirable 
conseouencas : (l)  it  makas  the  coupling  of  interacting  processes 
unnecessarily  tight  at  times » (2)  it  slows  all  other  monitor 
operations  down*  and  (3)  it  forces  the  programmer  to  U3a  only 
the  simplest  possible  conditions  (rather  than  the  most  natural 
ones).  It  wculd  therefore  be  desirable  to  limit  the  effects  of 
revaluation. 


k - i:  — - - zzj 
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A radical  (but  expansive)  solution  to  the  problem  of 
reevaluation  is  to  use  a common  store  that  is  IQ  times  faster 
than  a local  store  and  use  local  buffer  registers  to  interleave 
references  to  single  words  in  the  common  store.  If  ute  restrict 
the  machine  to  IQ  processors,  the  common  store  sill  always 
appear  to  be  as  fast  as  the  local  store  of  each  processor.  So 
the  effect  of  reevaluation  disappears. 

Sines  vs  are  no*  interleaving  operations  on  different  monitors 
simultaneously,  we  need  a separata  arbiter  for  each  monitor*  Each 
arbiter  will  be  represented  by  a gate  variable  in  the  common 
store.  The  lock  and  unlock  operations  an  a gate  variable  will  be 
performed  by  microprograms.  These  operations  ars  made  indivisible 
by  means  of  request  and  release  operations  on  the  single  hardware 
arbiter: 


lack(gats):  request 

while  gate 
gate:*  Q 
release 


□:  release;  request  end 


unlock(gate) : request;  gate:*  1;  release 

If  the  common  and  local  stores  have  the  same  access  times  the 
following  scheme  will  limit  the  overhead  of  synchronization.  Each 
monitor  is  now  represented  by  two  common  variables,  callad  the 
gate  and  the  clock.  The  gata  variable  controls  arbitration  by  means 
of  lock  and  unlock  operations  as  before.  The  clock  is  incremented 
cyclically  by  one  every  time  an  operation  has  changed  the  monitor 
variables.  The  clock  need  only  be  incremented  at  the  beginning  of 
each  when  statement  and  at  the  end  of  each  monitor  procedure. 

Instead  of  reevaluating  a complicated  boolean  expression  S 
repeatedly  a process  can  now  simply  look  at  the  monitor  clock 
until  it  changes  its  value.  Only  then  is  it  necessary  to  evaluate 
the  boolean  expression  again.  This  leads  to  the  fallowing  code 
sequence  for  a when  statement: 


l 

A 
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incremen t ( clack ) 
chile  not  Bt 
unlock(gate) 
awaitchanga  (clock) 
lock (gate) 


A delayed  process  stores  the  current  value  of  the  clock  in  a 
local  register  and  compares  it  with  the  clock  variable  every  1QG 
ft sac.  So  the  auiaitchanoe  instruction  is  microprogrammed  as 
f olloms 

register :*  clock 

chile  register  * clock:  idle (100)  end 

If  ne  assume  that  load  and  store  operations  on  the  common  store 
take  place  one  at  a time  (thanks  to  the  hardware  arbiter),  then 
it  is  not  necessary  far  a delayed  process  to  lock  the  gate  variabla 
just  to  look  at  the  clock  value.  Consequently,  the  idling  and 
reexamination  of  the  clock  variabla  only  requires  one  cycle  of  the 
common  store  (or  3 p sac)  every  1Q0  usee. 

As  long  as  the  state  of  a monitor  is  unchanged  a delayed  process 
can  at  most  consume  3 per  cant  of  the  common  store  cycles.  The 
reevaluation  of  a synchronizing  condition  S only  takes  placa  uihan 
another  procass  changes  the  state  of  a monitor  (and  its  clock).  In 
the  text  processing  example  considered  earlier  an  expression  is 
evaluated  in  1QQ  psec  every  33  msec  - an  overhead  of  only  0.3  per 
cent.  Within  certain  limits  the  overhead  of  synchronization  is  now 
influenced  very  little  by  the  complexity  of  the  expressions  used. 
The  existence  of  the  gate  and  clock  variables  is  hidden  from  the 
programmer. 

The  interleaving  of  common  store  references  also  makes  it 
practical  to  connect  a small  number  of  shared  peripherals  to  the 
common  store  (although  me  expect  that  most  devices  will  be 
connected  to  the  local  store  of  a microprocessor). 
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6.  A hierarchical  multiprocessor 


For  real-time  applications  that  require  more  than  10 
processors  we  propose  a block-structured  programming  language 
in  ahieh  subsets  of  processes  and  monitors  can  be  grouped  into 
a hierarchy  of  subsystems. 

A concurrent  program  now  consists  of  nested  subsystems  (Fig.  A;, 
Each  subsystem  in  turn  contains  a set  of  monitors  and  processes. 

A process  can  use  only  those  monitors  that  ars  mithin  its 
own  subsystem  and  within  the  subsystems  that  enclose  it.  In 
figure  4 the  outer  subsystem  consists  of  the  set  of  monitors  NO. 
The  first  inner  subsystem  consists  of  the  set  of  monitors  mi 
and  the  processes  Pi , P2,  ..  t Pm.  These  processes  can  use  tne 
local  monitors  mi  and  the  global  monitors  MO.  Similarly,  a 
process  Qi  in  the  other  inner  subsystem  can  use  its  local 
monitors  1*2  and  the  global  monitors  MO.  3ut  a process  within 
one  of  the  inner  subsystems  cannot  use  a monitor  within  the 
other  inner  subsystem. 


mo 

mT 
pi 
• • 
Pm 


*12 

QI 

e • 

Qn 


, Fig.  4.  A hierarchy  of  subsystems 
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We  assume  that  raeh  subsystem  uses  its  oan  monitors  an  ordar  of 
magnituda  mors  frequently  than  it  uses  the  global  monitors. 

Programs  erittan  in  such  a language  can  be  executed  by  a 
multiprocessor  mith.  a hierarchy  of  common  stores  (Pig.  5). 


common  store 


common  stores 


processors 


Fig.  5.  A hierarchical  multiprocessor 


Although  the  horizontal  lines  can  be  interpreted  as  bus  lines, 
Fig.  5 is  not  a diagram  of  the  connections  of  hardware  modules  to 
bus  lines.  It  is  a diagram  of  the  access  rights  of  processors  to 
store  modules. 

The  virtual  store  of  eaeh  processor  consists  of  its  local  store 
and  all  the  common  stores  that  lie  an  a path  from  the  processor 
to  the  root  of  the  storage  tree.  For  process  PI  the  common  stares 
are  Ml  and  no.  When  PI  refers  to  «Q  it  has  exclusive  access  to 
both  IK1  and  WO.  The  hierarchical  usage  of  arbiters  prevents  some 
deadlocks  [sj  . 

Although  tha  multiprocessor  in  Fig.  5 seems  to  be  tailored  to 
the  program  sketched  in  Fig.  4,  it  should  really  be  vieved  as  a 
general-purpose  machine  that  can  execute  any  concurrent  program 


with  one  or  two  subsystems  which  in  turn  are  dividad  into  no  more 
than  10  procaaaaa  aach. 

For  a 16  bit  procaasor  it  saama  reaaonabla  to  have  a three-level 
machine  ahara  tha  common  storaa  contain  8 K words  aach  and  tha 
local  atoraa  contain  16  K wards  aach.  Such  a machine  could 
includa  a total  of  100  processors  and  1.6  m wards.  A 32  bit 
processor  system  could  have  more  processors  and  mora  store  levels* 
It  seems  likely , however,  that  special-purpose  machines  ara 
needed  to  utilize  a much  higher  degree  of  concurrency 
ef fieisntly. 


7,  F inel  remarks 


The  recant  reduction  of  hardware  costs  for  microprocessors  will 
soon  put  great  pressure  on  software  designers  to  reduce  their  costs 
as  well.  The  only  way  to  do  that  is  to  write  all  software  in 
abstract  programming  languages  that  hide  the  irrelevant  details  of 
computers.  To  make  an  abstract  language  efficient  enough  for 
real-time  applications  one  must  design  a computer  architecture  that 
supports  the  language  features  directly. 

Ten  years  ago  this  approach  lad  to  tha  development  of  stack 
machines  for  sequential  programming  languages.  This  paper  suggests 
that  a multiprocessor  system  with  hierarchical  storage  will 
support  a concurrent  programming  language  with  processes  and 
monitors  efficiently,  machines  with  stacks  and  tree-structured 
storage  exploit  the  scope  rules  of  the  programming  language  to 
share  storage  efficiently  among  program  modules. 

The  paper  describes  a reasonably  simple  way  of  limiting  tha 
reevaluation  of  synchronizing  expressions  within  monitors.  It 
also  proposes  a block-structured  language  concept  (called  a 
subsystem)  which  enables  the  programmer  to  partition  the  data 
structures  of  a concurrent  program  hierarchically  among 
asynchronous  processes. 

It  needs  to  be  stressed  again  that  this  paper  is  only  a 
proposal  for  a multiprocessor  system  that  has  not  been  built 
yet. 
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This  is  an  outline  of  ths  syntax  of  a concurrent  programming 
language  aith  neeted  subsystems  containing  monitors  and  processes 
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